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Abstract 

Many models in natural language process¬ 
ing define probabilistic distributions over 
linguistic structures. We argue that (1) 
the quality of a model’s posterior distribu¬ 
tion can and should be directly evaluated, 
as to whether probabilities correspond to 
empirical frequencies; and (2) NLP uncer¬ 
tainty can be projected not only to pipeline 
components, but also to exploratory data 
analysis, telling a user when to trust and 
not trust the NLP analysis. We present a 
method to analyze calibration, and apply 
it to compare the miscalibration of sev¬ 
eral commonly used models. We also con¬ 
tribute a coreference sampling algorithm 
that can create confidence intervals for a 
political event extraction task. 1 

1 Introduction 

Natural language processing systems are imper¬ 
fect. Decades of research have yielded analyzers 
that mis-identify named entities, mis-attach syn¬ 
tactic relations, and mis-recognize noun phrase 
coreference anywhere from 10-40% of the time. 
But these systems are accurate enough so that their 
outputs can be used as soft, if noisy, indicators of 
language meaning for use in downstream analysis, 
such as systems that perform question answering, 
machine translation, event extraction, and narra¬ 
tive analysis (McCord et al., 2012; Gimpel and 
Smith, 2008; Miwa et al., 2010; Bamman et al., 
2013). 

To understand the performance of an ana¬ 
lyzer, researchers and practitioners typically mea¬ 
sure the accuracy of individual labels or edges 
among a single predicted output structure y, such 
as a most-probable tagging or entity clustering 
arg maxy P(y\x) (conditional on text data x). 

1 This is the extended version of a paper published in Pro¬ 
ceedings of EMNLP 2015. This version includes acknowl¬ 
edgments and an appendix. For all materials, see: http: 
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But a probabilistic model gives a probability 
distribution over many other output structures that 
have smaller predicted probabilities; a line of work 
has sought to control cascading pipeline errors by 
passing on multiple structures from earlier stages 
of analysis, by propagating prediction uncertainty 
through multiple samples (Finkel et al., 2006), 
A'-best lists (Venugopal et al., 2008; Toutanova 
et al., 2008), or explicitly diverse lists (Gimpel 
et al., 2013); often the goal is to marginalize over 
structures to calculate and minimize an expected 
loss function, as in minimum Bayes risk decod¬ 
ing (Goodman, 1996; Kumar and Byrne, 2004), or 
to perform joint inference between early and later 
stages of NLP analysis (e.g. Singh et al., 2013; 
Durrett and Klein, 2014). 

These approaches should work better when the 
posterior probabilities of the predicted linguistic 
structures reflect actual probabilities of the struc¬ 
tures or aspects of the structures. For example, say 
a model is overconfident: it places too much prob¬ 
ability mass in the top prediction, and not enough 
in the rest. Then there will be little benefit to us¬ 
ing the lower probability structures, since in the 
training or inference objectives they will be incor¬ 
rectly outweighed by the top prediction (or in a 
sampling approach, they will be systematically un¬ 
dersampled and thus have too-low frequencies). If 
we only evaluate models based on their top pre¬ 
dictions or on downstream tasks, it is difficult to 
diagnose this issue. 

Instead, we propose to directly evaluate the cal¬ 
ibration of a model’s posterior prediction distri¬ 
bution. A perfectly calibrated model knows how 
often it’s right or wrong; when it predicts an event 
with 80% confidence, the event empirically turns 
out to be true 80% of the time. While perfect 
accuracy for NLP models remains an unsolved 
challenge, perfect calibration is a more achievable 
goal, since a model that has imperfect accuracy 
could, in principle, be perfectly calibrated. In this 
paper, we develop a method to empirically analyze 
calibration that is appropriate for NLP models (§3) 



and use it to analyze common generative and dis¬ 
criminative models for tagging and classification 
(§4). 

Furthermore, if a model’s probabilities are 
meaningful, that would justify using its proba¬ 
bility distributions for any downstream purpose, 
including exploratory analysis on unlabeled data. 
In §6 we introduce a representative corpus explo¬ 
ration problem, identifying temporal event trends 
in international politics, with a method that is de¬ 
pendent on coreference resolution. We develop 
a coreference sampling algorithm (§5.2) which 
projects uncertainty into the event extraction, in¬ 
ducing a posterior distribution over event frequen¬ 
cies. Sometimes the event trends have very high 
posterior variance (large confidence intervals), 2 
reflecting when the NLP system genuinely does 
not know the correct semantic extraction. This 
highlights an important use of a calibrated model: 
being able to tell a user when the model’s predic¬ 
tions are likely to be incorrect, or at least, not giv¬ 
ing a user a false sense of certainty from an erro¬ 
neous NLP analysis. 

2 Definition of calibration 

Consider a binary probabilistic prediction prob¬ 
lem, which consists of binary labels and proba¬ 
bilistic predictions for them. Each instance has a 
ground-truth label y E {0,1}, which is used for 
evaluation. The prediction problem is to gener¬ 
ate a predicted probability or prediction strength 
q E [0,1], Typically, we use some form of a prob¬ 
abilistic model to accomplish this task, where q 
represents the model’s posterior probability 3 * of the 
instance having a positive label (y = 1). 

Let S = {(<?i,2/i), (< 72 , 2 / 2 ),-•' (qn,Vn)} be 
the set of prediction-label pairs produced by the 
model. Many metrics assess the overall quality 
of how well the predicted probabilities match the 
data, such as the familiar cross entropy (negative 
average log-likelihood), 

Le(y, q) = ^ ^2 Vi lo § — + (1 ~ Vi) lo g 7TI 

Qi -L Qi 

i 

or mean squared error, also known as the Brier 
score when y is binary (Brier, 1950), 

L2(y,q) = ^(2/i - Qi) 2 

i 

2 We use the terms confidence interval and credible inter¬ 
val interchangeably in this work; the latter term is debatably 
more correct, though less widely familiar. 

’Whether q comes from a Bayesian posterior or not is ir¬ 
relevant to the analysis in this section. All that matters is that 
predictions are numbers q £ [0,1]. 


Both tend to attain better (lower) values when q is 
near 1 when y = 1, and near 0 when y = 0; and 
they achieve a perfect value of 0 when all = y t A 

Let P(y, q) be the joint empirical distribution 
over labels and predictions. Under this notation, 
L 2 = E q>y [y — q 2 . Consider the factorization 

P (y, q ) = p (y | q) P(g) 

where P (y \ q) denotes the label empirical fre¬ 
quency, conditional on a prediction strength (Mur¬ 
phy and Winkler, 1987). 5 Applying this factor¬ 
ization to the Brier score leads to the calibration- 
refinement decomposition (DeGroot and Fienberg, 
1983), in terms of expectations with respect to the 
prediction strength distribution P(q): 

L 2 = E q [q-p q } 2 + Eg[p g (l -p q )} (1) 

-V- -V- 

Calibration MSE Refinement 

where we denote p q = P (y = 1 | q) for brevity. 

Here, calibration measures to what extent a 
model’s probabilistic predictions match their cor¬ 
responding empirical frequencies. Perfect calibra¬ 
tion is achieved when P (y = 1 | q) = q for all 
q\ intuitively, if you aggregate all instances where 
a model predicted q, they should have y = 1 at q 
percent of the time. We define the magnitude of 
miscalibration using root mean squared error: 
Definition 1 (RMS calibration error). 

CalibErr = \J^ q [q — P (y = 1 | q)} 2 

The second term of Eq 1 refers to refinement, 
which reflects to what extent the model is able 
to separate different labels (in terms of the con¬ 
ditional Gini entropy p q ( 1 — p q )). If the predic¬ 
tion strengths tend to cluster around 0 or 1, the re¬ 
finement score tends to be lower. The calibration- 
refinement breakdown offers a useful perspective 
on the accuracy of a model posterior. This paper 
focuses on calibration. 

There are several other ways to break down 
squared error, log-likelihood, and other probabilis¬ 
tic scoring rules. 6 * * * We use the Brier-based calibra¬ 
tion eiTor in this work, since unlike cross-entropy 

’These two loss functions are instances of proper scoring 

rules (Gneiting and Raftery, 2007; Brocker, 2009). 

5 We alternatively refer to this as label frequency or empir¬ 

ical frequency. The P probabilities can be thought of as fre¬ 

quencies from the hypothetical population the data and pre¬ 
dictions are drawn from. P probabilities are, definitionally 

speaking, completely separate from a probabilistic model that 
might be used to generate q predictions. 

‘'They all include a notion of calibration corresponding to 

a Bregman divergence (Brocker, 2009); for example, cross¬ 
entropy can be broken down such that KL divergence is the 
measure of miscalibration. 








Algorithm 1 Estimate calibration error using 
adaptive binning. 

Input: A set of N prediction-label pairs 

{( 91 . yi), ( 92 , 2 / 2 ), •• • , (gtv,ytv)}. 

Output: Calibration error. 

Parameter: Target bin size /3. 

Step 1: Sort pairs by prediction values q*, in ascending order. 

Step 2: For each, assign bin label bk = j + 1. 

Step 3: Define each bin Bi as the set of indices of pairs that 
have the same bin label. If the last bin has size less than 
p, merge it with the second-to-last bin (if one exists). Let 
{Si, S 2 , • • ■ , Bt} be the set of bins. 

Step 4: Calculate empirical and predicted probabilities per 
bin: 


Pi = 


1 

m 


Y y k 


and 


9i = 


1 

W\ 
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Step 5: Calculate the calibration error as the root mean 
squared error per bin, weighted by bin size in case they are 
not uniformly sized: 


CalibErr = 


\ 


JfY\ Bi \tii-Pi ) 2 


it does not tend toward infinity when near prob¬ 
ability 0; we hypothesize this could be an issue 
since both p and q are subject to estimation error. 

3 Empirical calibration analysis 

From a test set of labeled data, we can analyze 
model calibration both in terms of the calibration 
error, as well as visualizing the calibration curve 
of label frequency versus predicted strength. How¬ 
ever, computing the label frequencies P (y = 1 \ q) 
requires an infinite amount of data. Thus approx¬ 
imation methods are required to perform calibra¬ 
tion analysis. 

3.1 Adaptive binning procedure 

Previous studies that assess calibration in super¬ 
vised machine learning models (Niculescu-Mizil 
and Caruana, 2005; Bennett, 2000) calculate la¬ 
bel frequencies by dividing the prediction space 
into deciles or other evenly spaced bins—e.g. q E 
[0,0.1), q E [0.1,0.2), etc.—and then calculat¬ 
ing the empirical label frequency in each bin. This 
procedure may be thought of as using a form of 
nonparametric regression (specifically, a regres- 
sogram; Tukey 1961) to estimate the function 
f(q) = P(y = 1 | q) from observed data points. 
But models in natural language processing give 
very skewed distributions of confidence scores q 
(many are near 0 or 1), so this procedure performs 
poorly, having much more variable estimates near 


Algorithm 2 Estimate calibration error’s confi¬ 
dence interval by sampling. 

Input: A set of N prediction-label pairs 

{(9i,9i), ( 92 , 3 / 2 ),-• • , (9jv,3/jv)}. 

Output: Calibration error with a 95% confidence interval. 
Parameter: Number of samples, S. 

Step 1: Calculate {pi,p2,- m ' , Pt} from step 4 of Algo¬ 
rithm 1. 

Step 2: Draw S samples. For each s = 1..S, 

• For each bin i = 1..T, draw p[ s ^ ~ JV (pi, of), where 
of = pi(l — pi)/\Bi\. If necessary clip to [0,1]: 
p( s) := min(l,max(0,p[ s) )) 

• Calculate the sample’s CalibErr from using the pairs 
(9*,p[ S ' > ) as P er Step 5 of Algorithm 1. 

Step 3: Calculate the 95% confidence interval for the calibra¬ 
tion error as: 

CalibErr a vg ± 1.96 Serror 

where CalibErr avg and s error are the mean and the stan¬ 
dard deviation, respectively, of the CalibErrs calculated 
from the samples. 


the middle of the q distribution (Figure 1). 

We propose adaptive binning as an alterna¬ 
tive. Instead of dividing the interval [0,1] into 
fixed-width bins, adaptive binning defines the bins 
such that there are an equal number of points 
in each, after which the same averaging proce¬ 
dure is used. This method naturally gives wider 
bins to area with fewer data points (areas that re¬ 
quire more smoothing), and ensures that these ar¬ 
eas have roughly similar standard errors as those 
near the boundaries, since for a bin with /3 num¬ 
ber of points and empirical frequency p, the stan¬ 
dard error is estimated by y/p( 1 — p)/(3, which is 
bounded above by 0.5/y^. Algorithm 1 describes 
the procedure for estimating calibration error us¬ 
ing adaptive binning, which can be applied to any 
probabilistic model that predicts posterior proba¬ 
bilities. 

3.2 Confidence interval estimation 

Especially when the test set is small, estimating 
calibration error may be subject to error, due to 
uncertainty in the label frequency estimates. Since 
how to estimate confidence bands for nonparamet¬ 
ric regression is an unsolved problem (Wasserman, 
2006), we resort to a simple method based on the 
binning. We construct a binomial normal approx¬ 
imation for the label frequency estimate in each 
bin, and simulate from it; every simulation across 
all bins is used to construct a calibration error; 
these simulated calibration errors are collected to 
construct a normal approximation for the calibra- 
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Figure 1: (a) A skewed distribution of predictions on whether a word has the NN tag (§4.2.2). Calibration curves produced 
by equally-spaced binning with bin width equal to 0.02 (b) and 0.1 (c) can have wide confidence intervals. Adaptive binning 
(with 1000 points in each bin) (d) gives small confidence intervals and also captures the prediction distribution. The confidence 
intervals are estimated as described in §3.1. 


tion error estimate. Since we use bin sizes of at 
least /3 > 200 in our experiments, the central limit 
theorem justifies these approximations. We report 
all calibration errors along with their 95% confi¬ 
dence intervals calculated by Algorithm 2. 7 

3.3 Visualizing calibration 

In order to better understand a model’s 
calibration properties, we plot the pairs 
(Pi, 9i), (P2,92), ■ ■' j (pt, Qt) obtained from 
the adaptive binning procedure to visualize the 
calibration curve of the model—this visualization 
is known as a calibration or reliability plot. It 
provides finer grained insight into the calibra¬ 
tion behavior in different prediction ranges. A 
perfectly calibrated curve would coincide with 
the y = x diagonal line. When the curve lies 
above the diagonal, the model is underconfident 
(q < p q )\ and when it is below the diagonal, the 
model is overconfident ( q > p q ). 

An advantage of plotting a curve estimated from 
fixed-size bins, instead of fixed-width bins, is that 
the distribution of the points hints at the refinement 
aspect of the model’s performance. If the points’ 
positions tend to cluster in the bottom-left and top- 
right corners, that implies the model is making 
more refined predictions. 

4 Calibration for classification and 
tagging models 

Using the method described in §3, we assess the 
quality of posterior predictions of several classi¬ 
fication and tagging models. In all of our exper- 

7 A major unsolved issue is how to fairly select the bin 
size. If it is too large, the curve is oversmoothed and calibra¬ 
tion looks better than it should be; if it is too small, calibra¬ 
tion looks worse than it should be. Bandwidth selection and 
cross-validation techniques may better address this problem 
in future work. In the meantime, visualizations of calibration 
curves help inform the reader of the resolution of a particular 
analysis—if the bins are far apart, the data is sparse, and the 
specific details of the curve are not known in those regions. 


iments, we set the target bin size in Algorithm 1 
to be 5,000 and the number of samples in Algo¬ 
rithm 2 to be 10,000. 

4.1 Naive Bayes and logistic regression 

4.1.1 Introduction 

Previous work on Naive Bayes has found its prob¬ 
abilities to have calibration issues, in part due 
to its incorrect conditional independence assump¬ 
tions (Niculescu-Mizil and Caruana, 2005; Ben¬ 
nett, 2000; Domingos and Pazzani, 1997). Since 
logistic regression has the same log-linear repre¬ 
sentational capacity (Ng and Jordan, 2002) but 
does not suffer from the independence assump¬ 
tions, we select it for comparison, hypothesizing 
it may have better calibration. 

We analyze a binary classification task of Twit¬ 
ter sentiment analysis from emoticons. We col¬ 
lect a dataset consisting of tweets identified by the 
Twitter API as English, collected from 2014 to 
2015, with the “emoticon trick” (Read, 2005; Lin 
and Kolcz, 2012) to label tweets that contain at 
least one occurrence of the smiley emoticon “:)” 
as “happy” (y = 1) and others as y = 0. The 
smiley emoticons are deleted in positive examples. 
We sampled three sets of tweets (subsampled from 
the Decahose/Gardenhose stream of public tweets) 
with Jan-Apr 2014 for training, May-Dec 2014 for 
development, and Jan-Apr 2015 for testing. Each 
set contains 10 5 tweets, split between an equal 
number of positive and negative instances. We 
use binary features based on unigrams extracted 
from the twokenize.py 8 tokenization. We use the 
scikit-learn (Pedregosa et al., 2011) implementa¬ 
tions of Bernoulli Naive Bayes and L2-regularized 
logistic regression. The models’ hyperparameters 
(Naive Bayes’ smoothing paramter and logistic re¬ 
gression’s regularization strength) are chosen to 

8 https://github.com/myleott/ 
ark-twokenize-py 
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Figure 2: Calibration curve of (a) Naive Bayes and (b) lo¬ 
gistic regression on predicting whether a tweet is a “happy” 
tweet. 

maximize the F-1 score on the development set. 

4.1.2 Results 

Naive Bayes attains a slightly higher F-l score 
(NB 73.8% vs. LR 72.9%), but logistic regression 
has much lower calibration error: less than half 
as much RMSE (NB 0.105 vs. LR 0.041; Figure 
2). Both models have a tendency to be undercon¬ 
fident in the lower prediction range and overconfi¬ 
dent in the higher range, but the tendency is more 
pronounced for Naive Bayes. 

4.2 Hidden Markov models and conditional 
random fields 

4.2.1 Introduction 

Hidden Markov models (HMM) and linear chain 
conditional random fields (CRF) are another com¬ 
monly used pair of analogous generative and dis¬ 
criminative models. They both define a posterior 
over tag sequences P(y|x), which we apply to 
part-of-speech tagging. 

We can analyze these models in the binary cal¬ 
ibration framework (§2-3) by looking at marginal 
distribution of binary-valued outcomes of parts of 
the predicted structures. Specifically, we examine 
calibration of predicted probabilities of individual 
tokens’ tags (§4.2.2), and of pairs of consecutive 
tags (§4.2.3). These quantities are calculated with 
the forward-backward algorithm. 

To prepare a POS tagging dataset, we ex¬ 
tract Wall Street Journal articles from the En¬ 
glish CoNLL-2011 coreference shared task dataset 
from Ontonotes (Pradhan et ah, 2011), using the 
CoNLL-2011 splits for training, development and 
testing. This results in 11,772 sentences for train¬ 
ing, 1,632 for development, and 1,382 for testing, 
over a set of 47 possible tags. 

We train an HMM with Dirichlet MAP us¬ 
ing one pseudocount for every transition and 
word emission. For the CRF, we use the L 2 - 
regularized L-BFGS algorithm implemented in 


Figure 3: Calibration curves of (a) HMM, and (b) CRF, on 
predictions over all POS tags. 

CRFsuite (Okazaki, 2007). We compare an HMM 
to a CRF that only uses basic transition (tag-tag) 
and emission (tag-word) features, so that it does 
not have an advantage due to more features. In 
order to compare models with similar task perfor¬ 
mance, we train the CRF with only 3000 sentences 
from the training set, which yields the same accu¬ 
racy as the HMM (about 88.7% on the test set). 
In each case, the model’s hyperparameters (the 
CRF’s L 2 regularize^ the HMM’s pseudocount) 
are selected by maximizing accuracy on the devel¬ 
opment set. 

4.2.2 Predicting single-word tags 

In this experiment, we measure miscalibration of 
the two models on predicting tags of single words. 
First, for each tag type, we produce a set of 33,306 
prediction-label pans (for every token); we then 
concatenate them across the tags for calibration 
analysis. Figure 3 shows that the two models 
exhibit distinct calibration patterns. The HMM 
tends to be very underconfident whereas the CRF 
is overconfident, and the CRF has a lower (better) 
overall calibration error. 

We also examine the calibration errors of the 
individual POS tags (Figure 4(a)). We find that 
CRF is significantly better calibrated than HMM 
in most but not all categories (39 out of 47). For 
example, they are about equally calibrated on pre¬ 
dicting the NN tag. The calibration gap between 
the two models also differs among the tags. 

4.2.3 Predicting two-consecutive-word tags 

There is no reason to restrict ourselves to model 
predictions of single words; these models define 
marginal distributions over larger textual units. 
Next we examine the calibration of posterior pre¬ 
dictions of tag pairs on two consecutive words in 
the test set. The same analysis may be impor¬ 
tant for, say, phrase extraction or other chunk¬ 
ing/parsing tasks. 













(b) 

Figure 4: Calibration errors of HMM and CRF on predict¬ 
ing (a) single-word tags and (b) two-consecutive-word tags. 
Lower errors are better. The last two columns in each graph 
are the average calibration errors over the most common la¬ 
bels. 

We report results for the top 5 and 100 most fre¬ 
quent tag pairs (Figure 4(b)). We observe a simi¬ 
lar pattern as seen from the experiment on single 
tags: the CRF is generally better calibrated than 
the HMM, but the HMM does achieve better cali¬ 
bration errors in 29 out of 100 categories. 

These tagging experiments illustrate that, de¬ 
pending on the application, different models can 
exhibit different levels of calibration. 

5 Coreference resolution 

We examine a third model, a probabilistic model 
for within-document noun phrase coreference, 
which has an efficient sampling-based inference 
procedure. In this section we introduce it and ana¬ 
lyze its calibration, in preparation for the next sec¬ 
tion where we use it for exploratory data analysis. 

5.1 Antecedent selection model 

We use the Berkeley coreference resolution sys¬ 
tem (Durrett and Klein, 2013), which was origi¬ 
nally presented as a CRF; we give it an equivalent 
a series of independent logistic regressions (see 
appendix for details). The primary component of 
this model is a locally-normalized log-linear dis¬ 
tribution over clusterings of noun phrases, each 
cluster denoting an entity. The model takes a fixed 
input of N mentions (noun phrases), indexed by i 
in their positional order in the document. It posits 
that every mention i has a latent antecedent selec¬ 
tion decision, a* G — 1, new}, denoting 


which previous mention it attaches to, or NEW if it 
is starting a new entity that has not yet been seen 
at a previous position in the text. Such a mention- 
mention attachment indicates coreference, while 
the final entity clustering includes more links im¬ 
plied through transitivity. The model’s generative 
process is: 

Definition 2 (Antencedent coreference model and 
sampling algorithm). 

• For i = 1..N, sample 

CH ~ exp(w T f(i,ai,a:)) 

• Calculate the entity clusters as e := C'C'(a), 
the connected components of the antecedent 
graph having edges (i, ai) for i where a* / 
NEW. 

Here x denotes all information in the document 
that is conditioned on for log-linear features f. 
e = {ei, ...cm} denotes the entity clusters, where 
each element is a set of mentions. There are M en¬ 
tity clusters corresponding to the number of con¬ 
nected components in a. The model defines a joint 
distribution over antecedent decisions P( a|x) = 
11 , P(ai\x)-, it also defines a joint distribution over 
entity clusterings P(e|x), where the probability of 
an e is the sum of the probabilities of all a vectors 
that could give rise to it. In a manner similar to 
a distance-dependent Chinese restaurant process 
(Blei and Frazier, 2011), it is non-parametric in the 
sense that the number of clusters M is not fixed in 
advance. 

5.2 Sampling-based inference 

For both calibration analysis and exploratory ap¬ 
plications, we need to analyze the posterior distri¬ 
bution over entity clusterings. This distribution is 
a complex mathematical object; an attractive ap¬ 
proach to analyze it is to draw samples from this 
distribution, then analyze the samples. 

This antecedent-based model admits a very 
straightforward procedure to draw independent e 
samples, by stepping through Def. 2: indepen¬ 
dently sample each a, then calculate the connected 
components of the resulting antecedent graph. 
By construction, this procedure samples from the 
joint distribution of e (even though we never com¬ 
pute the probability of any single clustering e). 

Unlike approximate sampling approaches, such 
as Markov chain Monte Carlo methods used in 
other coreference work to sample e (Haghighi and 
Klein, 2007), here there are no questions about 
burn-in or autocorrelation (Kass et ah, 1998). 
Every sample is independent and very fast to 
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Figure 5: Coreference calibration curve for predicting 
whether two mentions belong to the same entity cluster. 

compute—only slightly slower than calculating 
the MAP assignment (due to the exp and normal¬ 
ization for each a,). We implement this algorithm 
by modifying the publicly available implementa¬ 
tion from Durrett and Klein. 9 

5.3 Calibration analysis 

We consider the following inference query: for a 
randomly chosen pair of mentions, are they coref¬ 
erent? Even if the model’s accuracy is compara¬ 
tively low, it may be the case that it is correctly 
calibrated—if it thinks there should be great vari¬ 
ability in entity clusterings, it may be uncertain 
whether a pair of mentions should belong together. 

Let lij be 1 if the mentions i and j are predicted 
to be coreferent, and 0 otherwise. Annotated data 
defines a gold-standard £^' value for every pair 
i,j. Any probability distribution over e defines a 
marginal Bernoulli distribution for every proposi¬ 
tion lij, marginalizing out e: 

P{kj = 1 I x) = ^2 Hihj) e e}P(e | x) (2) 

e 

where (i. j) E e is true iff there is an entity in e 
that contains both i and j. 

In a traditional coreference evaluation of the 
best-prediction entity clustering, the model as¬ 
signs 1 or 0 to every £{j and the pairwise precision 
and recall can be computed by comparing them to 
the corresponding t?). Here, we instead compare 
the qij = P{£ij = 1 | x, e) prediction strengths 
against empirical frequencies to assess pair¬ 
wise calibration, with the same binary calibration 
analysis tools developed in §3 by aggregating pairs 
with similar q t j values. Each q XJ is computed by 
averaging over 1,000 samples, simply taking the 
fraction of samples where the pair (i,j) is coref¬ 
erent. 

^Berkeley Coreference Resolution System, version 
1.1: http: / /nip . cs . berkeley. edu/pro jects/ 

coref.shtml 


We perform this analysis on the develop¬ 
ment section of the English CoNLL-2011 data 
(404 documents). Using the sampling inference 
method discussed in §5.2, we compute 4.3 mil¬ 
lions prediction-label pairs and measure their cali¬ 
bration error. Our result shows that the model pro¬ 
duces very well-calibrated predictions with less 
than 1% CalibErr (Figure 5), though slightly 
overconfident on middle to high-valued predic¬ 
tions. The calibration error indicates that it is the 
most calibrated model we examine within this pa¬ 
per. This result suggests we might be able to trust 
its level of uncertainty. 

6 Uncertainty in Entity-based 
Exploratory Analysis 

6.1 Entity-syntactic event aggregation 

We demonstrate one important use of calibration 
analysis: to ensure the usefulness of propagating 
uncertainty from coreference resolution into a sys¬ 
tem for exploring unannotated text. Accuracy can¬ 
not be calculated since there are no labels; but 
if the system is calibrated, we postulate that un¬ 
certainty information can help users understand 
the underlying reliability of aggregated extractions 
and isolate predictions that are more likely to con¬ 
tain eiTors. 

We illustrate with an event analysis application 
to count the number of “country attack events”: 
for a particular country of the world, how many 
news articles describe an entity affiliated with that 
country as the agent of an attack, and how does 
this number change over time? This is a simpli¬ 
fied version of a problem where such systems have 
been built and used for political science analysis 
(Schrodt et al., 1994; Schrodt, 2012; Leetaru and 
Schrodt, 2013; Boschee et al., 2013; O’Connor 
et al., 2013). A coreference component can im¬ 
prove extraction coverage in cases such as “Rus¬ 
sian troops were sighted ... and they attacked ...” 

We use the coreference system examined in §5 
for this analysis. To propagate coreference un¬ 
certainty, we re-run event extraction on multiple 
coreference samples generated from the algorithm 
described in §5.2, inducing a posterior distribution 
over the event counts. To isolate the effects of 
coreference, we use a very simple syntactic depen¬ 
dency system to identify affiliations and events. 
Assume the availability of dependency parses for 
a document d, a coreference resolution e, and a 
lexicon of country names, which contains a small 
set of words w(c) for each country c; for example, 
rc(FRA) = {france, french}. The binary function 




f(c, e; x d ) assesses whether an entity e is affiliated 
with country c and is described as the agent of an 
attack, based on document text and parses x d , f 
returns true iff both: 10 

• There exists a mention fee described 
as country c: either its head word is in 
w(c) (e.g. “Americans”), or its head word 
has an nmod or amod modifier in w{c) 
(e.g. “American forces”, “president of the 
U.S.”); and there is only one unique country 
c among the mentions in the entity. 

• There exists a mention j e e which is the 
nsubj or agent argument to the verb “attack” 
(e.g. “they attacked”, “the forces attacked”, 
“attacked by them”). 

For a given c, we first calculate a binary variable 
for whether there is at least one entity fulfilling / 
in a particular document, 

a{d,c,e d )= \J f(c,e;x d ) (3) 

e£e d 

and second, the number of such documents in d(t), 
the set of New York Times articles published in a 
given time period t, 

n(t,c,e d( t)) = Y a (d,c,e d ) (4) 

d£d(t) 

These quantities are both random variables, since 
they depend on e; thus we are interested in the 
posterior distribution of n, marginalizing out e, 

P{n(t,c,e m ) | x m ) (5) 

If our coreference model was highly certain (only 
one structure, or a small number of similar struc¬ 
tures, had most of the probability mass in the space 
of all possible structures), each document would 
have an a posterior near either 0 or 1, and their 
sum in Eq. 5 would have a narrow distribution. But 
if the model is uncertain, the distribution will be 
wider. Because of the transitive closure, the prob¬ 
ability of a is potentially more complex than the 
single antecedent linking probability between two 
mentions—the affiliation and attack information 
can propagate through a long coreference chain. 

6.2 Results 

We tag and parse a 193,403 article subset of the 
Annotated New York Times LDC corpus (Sand- 
haus, 2008), which includes articles about world 

10 Syntactic relations are Universal Dependencies 
(de Marneffe et al., 2014); more details for the extrac¬ 
tion rules are in the appendix. 


news from the years 1987 to 2007 (details in ap¬ 
pendix). For each article, we run the coreference 
system to predict 100 samples, and evaluate / on 
every entity in every sample. 11 The quantity of 
interest is the number of articles mentioning at¬ 
tacks in a 3-month period (quarter), for a given 
country. Figure 6 illustrates the mean and 95% 
posterior credible intervals for each quarter. The 
posterior mean m is calculated as the mean of the 
samples, and the interval is the normal approxima¬ 
tion m ± 1.96 s, where s is the standard deviation 
among samples for that country and time period. 

Uncertainty information helps us understand 
whether a difference between data points is real. 
In the plots of Figure 6, if we had used a 1-best 
coreference resolution, only a single line would be 
shown, with no assessment of uncertainty. This 
is problematic in cases when the model genuinely 
does not know the correct answer. For example, 
the 1993-1996 period of the USA plot (Figure 6, 
top) shows the posterior mean fluctuating from 1 
to 5 documents; but when credible intervals are 
taken into consideration, we see that model does 
not know whether the differences are real, or were 
caused by coreference noise. 

A similar case is highlighted at the bottom plot 
of Figure 6. Here we compare the event counts 
for Yugoslavia and NATO, which were engaged in 
a conflict in 1999. Did the New York Times de¬ 
vote more attention to the attacks by one particu¬ 
lar side? To a 1-best system, the answer would be 
yes. But the posterior intervals for the two coun¬ 
tries’ event counts in mid-1999 heavily overlap, 
indicating that the coreference system introduces 
too much uncertainty to obtain a conclusive an¬ 
swer for this question. Note that calibration of the 
coreference model is important for the credible in¬ 
tervals to be useful; for example, if the model was 
badly calibrated by being overconfident (too much 
probability over a small set of similar structures), 
these intervals would be too narrow, leading to in¬ 
correct interpretations of the event dynamics. 

Visualizing this uncertainty gives richer infor¬ 
mation for a potential user of an NFP-based sys¬ 
tem, compared to simply drawing a line based on 
a single 1-best prediction. It preserves the gen¬ 
uine uncertainty due to ambiguities the system was 
unable to resolve. This highlights an alternative 
use of Finkel et al. (2006)’s approach of sampling 
multiple NFP pipeline components, which in that 
work was used to perform joint inference. Instead 

11 We obtained similar results using only 10 samples. We 
also obtained similar results with a different query function, 
the total number of entities, across documents, that fulfill /. 




USA 


of focusing on improving an NLP pipeline, we can 
pass uncertainty on to exploratory purposes, and 
try to highlight to a user where the NLP system 
may be wrong, or where it can only imprecisely 
specify a quantity of interest. 

Finally, calibration can help error analysis. For 
a calibrated model, the more uncertain a predic¬ 
tion is, the more likely it is to be erroneous. While 
coreference errors comprise only one part of event 
extraction errors (alongside issues in parse qual¬ 
ity, factivity, semantic roles, etc.), we can look at 
highly uncertain event predictions to understand 
the nature of coreference errors relative to our 
task. We manually analyzed documents with a 
50% probability to contain an “attack”ing country- 
affiliated entity, and found difficult coreference 
cases. 

In one article from late 1990, an “attack” event 
for IRQ is extracted from the sentence “But some 
political leaders said that they feared that Mr Hus¬ 
sein might attack Saudi Arabia”. The mention 
“Mr. Hussein” is classified as IRQ only when it 
is coreferent with a previous mention “President 
Saddam Hussein of Iraq”; this occurs only 50% 
of the time, since in some posterior samples the 
coreference system split apart these two “Hussein” 
mentions. This particular document is addition¬ 
ally difficult, since it includes the names of more 
than 10 countries (e.g. United States, Saudi Ara¬ 
bia, Egypt), and some of the Hussein mentions are 
even clustered with presidents of other countries 
(such as “President Bush”), presumably because 
they share the “president” title. These types of er¬ 
rors are a major issue for a political analysis task; 
further analysis could assess their prevalence and 
how to address them in future work. 

7 Conclusion 

In this work, we argue that the calibration of pos¬ 
terior predictions is a desirable property of prob¬ 
abilistic NLP models, and that it can be directly 
evaluated. We also demonstrate a use case of 
having calibrated uncertainty: its propagation into 
downstream exploratory analysis. 

Our posterior simulation approach for ex¬ 
ploratory and error analysis relates to posterior 
predictive checking (Gelman et ah, 2013), which 
analyzes a posterior to test model assumptions; 
Mimno and Blei (2011) apply it to a topic model. 

One avenue of future work is to investigate 
more effective nonparametric regression methods 
to better estimate and visualize calibration error, 
such as Gaussian processes or bootstrapped kernel 
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Figure 6: Number of documents with an “attack”ing coun¬ 
try per 3-month period, and coreference posterior uncertainty 
for that quantity. The dark line is the posterior mean, and 
the shaded region is the 95% posterior credible interval. See 
appendix for more examples. 

density estimation. 

Another important question is: what types of in¬ 
ferences are facilitated by correct calibration? In¬ 
tuitively, we think that overconfidence will lead 
to overly narrow confidence intervals; but in what 
sense are confidence intervals “good” when cal¬ 
ibration is perfect? Also, does calibration help 
joint inference in NLP pipelines? It may also assist 
calculations that rely on expectations, such as in¬ 
ference methods like minimum Bayes risk decod¬ 
ing, or learning methods like EM, since calibrated 
predictions imply that calculated expectations are 
statistically unbiased (though the implications of 
this fact may be subtle). Finally, it may be in¬ 
teresting to pursue recalibration methods, which 
readjust a non-calibrated model’s predictions to 
be calibrated; recalibration methods have been de¬ 
veloped for binary (Platt, 1999; Niculescu-Mizil 
and Caruana, 2005) and multiclass (Zadrozny and 
Elkan, 2002) classification settings, but we are 
unaware of methods appropriate for the highly 
structured outputs typical in linguistic analysis. 
Another approach might be to directly constrain 
CalibErr = 0 during training, or tty to reduce it 
as a training-time risk minimization or cost objec¬ 
tive (Smith and Eisner, 2006; Gimpel and Smith, 
2010; Stoyanov et ah, 2011; Brummer and Dod- 
dington, 2013). 

Calibration is an interesting and important prop¬ 
erty of NLP models. Further work is necessary to 
address these and many other questions. 
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1 Sampling a deterministic function of a 
random variable 

In several places in this paper, we define proba¬ 
bility distributions over deterministic functions of 
a random variable, and sample from them by ap¬ 
plying the deterministic function to samples of the 
random variable. This should be valid by con¬ 
struction, but we supply the following argument 
for further justification. 

X is a random variable and g(x) is a determin¬ 
istic function which takes a value of X as its in¬ 
put. Since g depends on a random variable, g{X) 
is a random variable as well. The distribution 
for g(X), or aspects of it (such as a PMF or in¬ 
dependent samples from it) can be calculated by 
marginalizing out X with a Monte Carlo approx¬ 
imation. Assuming g has discrete outputs (as is 
the case for the event counting function n, or con¬ 
nected components function CC), we examine the 
probability mass function: 

pmf (h) = P{g{X) = h) ( 6 ) 

= X! = h\x) P(x) (7) 

X 

= ^2 1 {g{x) = h}P(x ) (8) 

X 

~ ^ HsO®) = M ( 9 ) 

° x~P(X) 

Eq. 8 holds because g(x) is a deterministic func¬ 
tion, and Eq. 9 is a Monte Carlo approximation 
that uses S samples from P(x). 

This implies that a set of g values calculated on 
x samples, {< 7 ( 2 ;^) : x ^ ~ P(x)}, should con¬ 
stitute a sample from the distribution P(g(X))' in 
our event analysis section we usually call this the 
“posterior” distribution of g(X) (the n(t, c) func¬ 
tion there). In our setting, we do not directly use 
the PMF calculation above; instead, we construct 
normal approximations to the probability distribu¬ 
tion g(X). 

We use this technique in several places. For the 
calibration error confidence interval, the calibra¬ 
tion error is a deterministic function of the uncer¬ 
tain empirical label frequencies p t ; there, we prop¬ 
agate posterior uncertainty from a normal approx¬ 
imation to the Bernoulli parameter’s posterior (the 
Pi distribution under the central limit theorem) 
through simulation. In the coreference model, the 


connected components function is a determinis¬ 
tic function of the antecedent vector; thus repeat¬ 
edly calculating := CC(a^) yields samples 
of entity clusterings from their posterior. For the 
event analysis, the counting function n(t, c, ed(t)) 
is a function of the entity samples, and thus can be 
recalculated on each—this is a multiple step deter¬ 
ministic pipeline, which postprocesses simulated 
random variables. 

As in other Monte Carlo-based inference tech¬ 
niques (as applied to both Bayesian and frequentist 
(e.g. bootstrapping) inference), the mean and stan¬ 
dard deviation of samples drawn from the distribu¬ 
tion constitute the mean and standard deviation of 
the desired posterior distribution, subject to Monte 
Carlo eiTor due to the finite number of samples, 
which by the central limit theorem shrinks at a rate 
of 1/ \fS. The Monte Carlo standard error for es¬ 
timating the mean is <j/\fS where <j is the stan¬ 
dard deviation. So with 100 samples, the Monte 
Carlo standard error for the mean is \/l00 = 10 
times smaller than standard deviation. Thus in the 
time series graphs, which are based on S = 100 
samples, the posterior mean (dark line) has Monte 
Carlo uncertainty that is 10 times smaller than the 
vertical gray area (95% Cl) around it. 

2 Normalization in the coreference 
model 

Durrett and Klein (2013) present their model as a 
globally normalized, but fully factorized, CRF: 

P(a\x) = ^ n exp(w T f(i, a*, x)) 
i 

Since the factor function decomposes indepen¬ 
dently for each random variable aq, their probabil¬ 
ities arc actually independent, and can be rewritten 
with local normalization, 

P(a\x) = TT ^ exp(w T f(*,aj,x)) 

Ai 

i 

This interpretation justifies the use of independent 
sampling to draw samples of the joint posterior. 

3 Event analysis: Corpus selection, 
country affiliation, and parsing 

Articles are filtered to yield a dataset about world 
news. In the New York Times Annotated Corpus, 
every article is tagged with a large set of labels. 
We include articles that contain a category whose 
label starts with the string Top/News/World, and 
exclude articles with any category matching the 


regex /(Sports \ Opinion ), and whose text body con¬ 
tains a mention of at least one country name. 

Country names are taken from the dictionary 
countryJgos.txt based on previous work (http : 
//brenocon. com/irevents/). Country 
name matching is case insensitive and uses light 
stemming: when frying to match a word against 
the lexicon, if a match is not found, it backs off to 
stripping the last and last two characters. (This is 
usually unnecessary since the dictionary contains 
modifier forms.) 

POS, NER, and constituent and dependency 
parses arc produced with Stanford CoreNLP 3.5.2 
with default settings except for one change, to use 
its shift-reduce constituent parser (for convenience 
of processing speed). We treat tags and parses as 
fixed and leave their uncertainty propagation for 
future work. 

When formulating the extraction rules, we ex¬ 
amined frequencies of all syntactic dependencies 
within country-affiliated entities, in order to help 
find reasonably high-coverage syntactic relations 
for the “attack” rule. 

4 Event time series graphs 

The following pages contain posterior time series 
graphs for 20 countries, as described in the sec¬ 
tion on coreference-based event aggregation, in or¬ 
der of decreasing total event frequency. As in the 
main paper, the blue line indicates the posterior 
mean, and the gray region indicates 95% posterior 
credibility intervals, with count aggregation at the 
monthly level. The titles are IS03 country codes. 
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